This report is the first one to document and study the feasability of the automatic quality evaluation of experimental literature investigating bio–nano interactions. The first step of this automatic evaluation is to isolate the section Materials and Methods. The goal is to use later this section only to assess if the characterisation of the nano-materials is done and ebaluate the quality of the articles.
This report contain preliminary analyses and exploration of the data contained in the corpus of text. The first goal of this analyses is to gain some understanding of the structure of the texts inside the corpus of articles and the relations of the lemmas “material(s)” and “method(s)” to this corpus.
The second goal is to investigate how to discriminate the beginning of the section “Materials and methods”. The main problem to identify entry of the section Materials and Methods is that some of this two words can be present in the text of the article (typically “cf” material and methods").
The corpus of text has been created from the 751 articles from the folder “Full Text dev set”, which contain 751 articles converted into txt file format. The others articles are kept unseen to test the efficacy of any other tools developped later in “real life condition”.
Few definitions to frame the problem :
Token : Word form or punctuation symbol. “,”, “(” are tokens, but also “and” or “method”.
Lemma : Lemma or stem of word form. “Materials” and “materials” token have the same lemma “materials”, for example.
Head : Head of the current word, which is either a value of token_id or zero.
A quick exploratory data analysis on the article Abrams, MT et al, 2010, led to think that the the “materials” token from the section material and method has a specific property : is head_token_id is equal to zero, i.e. the “head” of this word is itself (cf example under). This led to think that sections titles of aritcles may have this property. This hypothesis will be test in the first part of this report, and in a later section, for the lemma “materials” and “material” (Co-occurences for materials and material when their head_token_id = 0)
In the later section, we will try differents criteria to isolate some lemmas “materials”, “material”, “methods” and “method”. We will use a technic, co-occurences, to explore the surronding of the differents lemmas in the text and evaluate if this criteria allow to discriminate the beginning of the section materials and methods from the remaining of the article.
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. It is a good way to create informal reports describing data analysis projects as a web page, and a good way to mix code and description in a readable maner. There is even books in this format, ranging from Data Analysis for the Life Sciences to Text Mining with R, A Tidy Approach, so anybody can understand and retake this work. This report is also code, it can be recompiled with new data (including an other model for the annotation of the corpus).
library(udpipe)
library(lattice)
library(wordcloud)
library(igraph)
library(ggraph)
library(ggplot2)
library(dplyr)
The following lines load the corpus of text, already annotated and tokenized :
x <- readRDS(file = "annotation_lines.rds")
x <- as.data.frame(x)
length(unique(x$doc_id))
## [1] 751
Here an example of a token “materials” with a head_token_id = 0 :
x[7467,]
## doc_id paragraph_id sentence_id
## 7467 doc1 602 841
## sentence
## 7467 siRNA concentration in tissues was determined using a modified stem-loop RT-PCR protocol.21 Samples preserved in RNAlater (Qiagen, Valencia, CA) were homogenized in Trizol buffer (Qiagen) at 20 µg/µl in a bead mill.
## token_id token lemma upos xpos feats head_token_id dep_rel
## 7467 30 Qiagen Qiagen PROPN SG-NOM Number=Sing 27 appos
## deps misc
## 7467 <NA> SpaceAfter=No
Considering the observation that, in “Materials and Methods” the head_token_ID was 0 for the token “Materials”, one idea was to explore what are, in the corpus of texts, the most common lemma with a head_token_ID equal to zero.
The expected outcome of this analysis could be to retrieve the usual sections title of scientific articles inside the most common words, like Abstract or Results. The goal is to assess if it is a consistent property of the titles of section inside the articles and uncover potential synonyms to “materials and methods” like “experimental section”.
stats <- subset(x, head_token_id == 0) #https://bnosac.github.io/udpipe/docs/doc7.html
stats <- txt_freq(x = stats$lemma)
stats$key <- factor(stats$key, levels = rev(stats$key))
barchart(key ~ freq, data = head(stats, 30), col = "cadetblue", main = "Most occurring words with Head_token_id = 0", xlab = "Freq")
Nonetheless, it seems that this assumption was quite naive, as lot of token have this property. Let’s filter for specific lemmas that correspond to usual title of section, like abstract of results :
stats<-stats %>% filter(key %in% c("material", "materials", "result", "results", "abstract", "introduction" , "method", "methods", "discussion", "references"))
stats$key <- factor(stats$key, levels = rev(stats$key))
barchart(key ~ freq, data = head(stats, 30), col = "cadetblue", main = "Count of lemma for usual sections name with Head_token_id = 0", xlab = "Freq")
stats
## key freq freq_pct
## 1 result 1889 0.3692344981
## 2 method 379 0.0740814583
## 3 discussion 230 0.0449570855
## 4 introduction 166 0.0324472878
## 5 material 121 0.0236513363
## 6 methods 42 0.0082095547
## 7 abstract 20 0.0039093118
## 8 results 1 0.0001954656
Some section titles seems to have the afored mentionned property. Nonetheless, the number does not match the total number of articles in this corpus (751). To take the example of the token discussion, or some articles does not have a section dicussion, or, more probably, the token discussion does not have the property mentionned earlier. We can answer this question :
occurrences<-which(x$lemma=="discussion")
length(occurrences)
## [1] 900
length(unique(x[occurrences,]$doc_id))
## [1] 709
There is 900 occurrences of the word discussion in all the corpus, and 709 article with this word. It seems really likely that discriminating tokens that are section titles just with a head token ID of zero is not sufficient.
To explore the relationships of the lemmas “material(s)” and “method(s)” with the rest of the corpus, we can analyse what are the most recurents head tokens for the lemmas “material” and “materials”. The goals of the analysis are :
grep_lemma_head_token_id <- function(index){
#catch the lemma corresponding to the head_token_id of the token at the entry "index" of x
#x[index,] return a token and all the associated data : lemma, but also sentence and doc_id
occurrence<-x[index,] #x[index,], where x is the dataframe of annotation generated by udpipe
head_token_id<-occurrence$head_token_id
head_token_id<-as.numeric(head_token_id)
sentence_id<-occurrence$sentence_id
doc_id<-occurrence$doc_id
#the following line query the lemma of the head_token_id based on the previous parameters
lemma_head_token_id<-x[which(x$sentence_id==sentence_id & x$doc_id==doc_id)[head_token_id],]$lemma
if (head_token_id==0) {lemma_head_token_id=occurrence$lemma}
return(lemma_head_token_id)
}
material_occurrences<-which(x$lemma=="material")
head_token_lemmas<-sapply(material_occurrences, grep_lemma_head_token_id)
mytable<-table(head_token_lemmas)
stats<-as.data.frame(mytable)
stats<-stats[order(stats$Freq, decreasing = TRUE),]
stats$key <- factor(stats$head_token_lemmas, levels = rev(stats$head_token_lemmas))
barchart(key ~ Freq, data = head(stats, 30), col = "cadetblue", main = "Most occurring lemma corresponding to the head_token_id \n for lemma material", xlab = "Freq")
occurrences<-which(x$lemma=="materials")
head_token_lemmas<-sapply(occurrences, grep_lemma_head_token_id)
mytable<-table(head_token_lemmas)
stats<-as.data.frame(mytable)
stats<-stats[order(stats$Freq, decreasing = TRUE),]
stats$key <- factor(stats$head_token_lemmas, levels = rev(stats$head_token_lemmas))
barchart(key ~ Freq, data = head(stats, 30), col = "cadetblue", main = "Most occurring words with Head_token_id = 0 \n for lemma materialS with an s", xlab = "Freq")
occurrences<-which(x$lemma=="method")
head_token_lemmas<-sapply(occurrences, grep_lemma_head_token_id)
mytable<-table(head_token_lemmas)
stats<-as.data.frame(mytable)
stats<-stats[order(stats$Freq, decreasing = TRUE),]
stats$key <- factor(stats$head_token_lemmas, levels = rev(stats$head_token_lemmas))
barchart(key ~ Freq, data = head(stats, 30), col = "cadetblue", main = "Most occurring words with Head_token_id = 0 \n for lemma methods with a S", xlab = "Freq")
occurrences<-which(x$lemma=="methods")
head_token_lemmas<-sapply(occurrences, grep_lemma_head_token_id)
mytable<-table(head_token_lemmas)
stats<-as.data.frame(mytable)
stats<-stats[order(stats$Freq, decreasing = TRUE),]
stats$key <- factor(stats$head_token_lemmas, levels = rev(stats$head_token_lemmas))
barchart(key ~ Freq, data = head(stats, 30), col = "cadetblue", main = "Most occurring words with Head_token_id = 0 \n for lemma methods with a S", xlab = "Freq")
head(stats, 10)
## head_token_lemmas Freq key
## 59 MATERIALS 72 MATERIALS
## 37 describ 42 describ
## 63 methods 42 methods
## 10 and 22 and
## 48 Immunol 22 Immunol
## 106 use 13 use
## 57 material 11 material
## 58 Material 9 Material
## 51 j 5 j
## 88 section 5 section
In the next sessions we test differents criteria to discriminate the lemmas “materials” and “material” inside the articles. The idea is to find a criteria that allow to identify the beginning of the section “materials and methods”.
Co-occurrence is an analysis that allow to see how words are used either in the same sentence or next to each other. We will use this approach to have a sense of what is the neighbourhood of the lemmas we isolated based on each criteria.
There is several type of cooccurrences analysis : * Looking at which words are located in the same document/sentence/paragraph. * Looking at which words are followed by another word. * Looking at which words are in the neighbourhood of the word as in follows the word within skipgram number of words.
Cf doc of the package Updipe for the three possible use. We will use the second approach, as it is the most relevant to our goal and as it is the most simple to interpret. Differents skipgram can be used to got an idea of the distance or more proximal neighbourhood.
The two function above are meant to gain some place in the document. The first one plot the word network, a common technique to visualise word cooccurrences, after the filtration of the cooccurrences that concerns only the lemma of interrest.
plot_cooccurrence <- function(stats, lemma, title){
#function to gain place and make this Rmarkdown document more clear
stats <- stats %>% filter(term1 %in% c(lemma) | term2 %in% c(lemma))
wordnetwork <- head(stats, 30)
wordnetwork <- graph_from_data_frame(wordnetwork)
ggraph(wordnetwork, layout = "fr") +
geom_edge_link(aes(width = cooc, edge_alpha = cooc), edge_colour = "pink") +
geom_node_text(aes(label = name), col = "blue", size = 5) +
theme_graph(base_family = "Helvetica") +
theme(legend.position = "none") +
labs(title = title)
}
head_cooc <- function(stats, lemma){
#function to gain place and make this Rmarkdown document more clear
stats <- stats %>% filter(term1 %in% c(lemma) | term2 %in% c(lemma))
head(stats, 30)
}
stats <- cooccurrence(x = x$lemma, skipgram = 0)
Bigger skipgram were not really relevant. Here we can simply count the elements of the dataframe stats to see how many times each word follow each other.
plot_cooccurrence(stats, lemma="materials", title="Co-occurences for the lemma materials")
head_cooc(stats, lemma="materials")
## term1 term2 cooc
## 1 contact materials 1
## 2 materials 22 1
plot_cooccurrence(stats, lemma="material", title="Co-occurences for the lemma material")
head_cooc(stats, lemma="material")
## term1 term2 cooc
## 1 material . 401
## 2 material be 341
## 3 material , 323
## 4 the material 222
## 5 of material 214
## 6 this material 208
## 7 supplementary material 203
## 8 material and 173
## 9 material in 148
## 10 material ( 146
## 11 material at 131
## 12 t material 126
## 13 test material 80
## 14 material for 75
## 15 bulk material 67
## 16 material that 59
## 17 material have 58
## 18 nanotube material 56
## 19 and material 55
## 20 foreign material 54
## 21 material with 53
## 22 reference material 48
## 23 material : 43
## 24 material to 38
## 25 genetic material 38
## 26 material on 34
## 27 material the 32
## 28 material can 32
## 29 . material 31
## 30 material [ 30
plot_cooccurrence(stats, lemma="methods", title="Co-occurences for the lemma methods")
head_cooc(stats, lemma="methods")
## term1 term2 cooc
## 1 and methods 143
## 2 methods . 55
## 3 in methods 45
## 4 . methods 29
## 5 Immunol methods 25
## 6 methods Material 20
## 7 , methods 17
## 8 see methods 17
## 9 methods ) 17
## 10 methods for 15
## 11 methods to 11
## 12 use methods 10
## 13 methods , 10
## 14 methods t 9
## 15 methods Chemical 9
## 16 methods and 9
## 17 methods Animal 8
## 18 Mech methods 8
## 19 methods section 8
## 20 methods Preparation 7
## 21 methods Nanoparticle 6
## 22 methods cell 5
## 23 methods Mol 4
## 24 methods 2.1 4
## 25 test methods 4
## 26 supplementary methods 4
## 27 methods material 3
## 28 methods Synthesis 3
## 29 methods the 3
## 30 methods NP 3
plot_cooccurrence(stats, lemma="method", title="Co-occurences for the lemma method")
head_cooc(stats, lemma="method")
## term1 term2 cooc
## 1 and method 518
## 2 method . 502
## 3 method for 469
## 4 the method 449
## 5 . method 407
## 6 method of 318
## 7 method be 280
## 8 method , 279
## 9 method to 220
## 10 method ( 215
## 11 this method 151
## 12 method 2.1 133
## 13 method and 133
## 14 method use 129
## 15 method describ 127
## 16 method : 116
## 17 a method 94
## 18 method in 83
## 19 method [ 79
## 20 ) method 65
## 21 method have 61
## 22 test method 51
## 23 method as 51
## 24 method ) 49
## 25 method Material 47
## 26 sensitive method 46
## 27 method with 44
## 28 method Mol 41
## 29 method Animal 38
## 30 vitro method 38
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method")
head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
## term1 term2 cooc
## 1 and method 518
## 2 method . 502
## 3 method for 469
## 4 the method 449
## 5 . method 407
## 6 material . 401
## 7 material be 341
## 8 material , 323
## 9 method of 318
## 10 method be 280
## 11 method , 279
## 12 the material 222
## 13 method to 220
## 14 method ( 215
## 15 of material 214
## 16 this material 208
## 17 supplementary material 203
## 18 material and 173
## 19 this method 151
## 20 material in 148
## 21 material ( 146
## 22 and methods 143
## 23 method 2.1 133
## 24 method and 133
## 25 material at 131
## 26 method use 129
## 27 method describ 127
## 28 t material 126
## 29 method : 116
## 30 a method 94
Similar to the previous approach, we want to explore the relationships of the differents lemma with their neighbourhood in the corpus of text, but we restrict the analysis for sentences for which the lemma material or materials is the head token of itself.
Even if not all the “Materials and Methods” section titles has a “materials” lemma with a head_token_id equal to zero, the opposite could be true.
Here, by restricting to the lemmas “materials” and “material” which have a head_token_id = 0, we can visualize their statistical association with other words and understand if this subsets of token is really delimiting the beginning of section “material and methods”.
The first function allow to filter for sentences where the lemma material or materials is the head. The following lines calculate the co-occurrences and draw the plot as previously.
create_subset_corpus<- function(index){
#this function is aimed to help construct a subset of x for the part of the analysis :
#Co-occurences for materials and material when their head_token_id = 0
#x[index,] return a token and all the associated data : lemma, but also sentence and doc_id
occurrence<-x[index,] #x[index,], where x is the dataframe of annotation generated by udpipe
sentence_id<-occurrence$sentence_id
doc_id<-occurrence$doc_id
#the following lines collect the head_token_id and test if is equal to zero
#if so, its output the tokens of the sentences
head_token_id<-occurrence$head_token_id
if (head_token_id==0) {return(strip_corpus(doc_id, sentence_id))}
return()
}
strip_corpus <- function(doc_id, sentence_id){
#this function returns all the lemma of a sentence, in the appropriate format
#the purpose of doing so is to allow for calculation of cooccurence of words inside this sentences
#for this we need all the elements of the sentence
sentence_id<-as.numeric(sentence_id)
subset_article<-x[which(x$sentence_id==sentence_id & x$doc_id==doc_id),]
return(subset_article)
}
occurrences<-which(x$lemma=="materials")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus(index))
subset_corpus<-do.call(rbind, subset_corpus)
# stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
# plot_cooccurrence(stats, lemma="materials", title="Co-occurences for the lemma materials \n when its head_token_id is equal to 0")
# head_cooc(stats, lemma="materials")
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when head_token_id of lemma materials is equal to 0")
head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
## term1 term2 cooc
## 1 and method 518
## 2 method . 502
## 3 method for 469
## 4 the method 449
## 5 . method 407
## 6 material . 401
## 7 material be 341
## 8 material , 323
## 9 method of 318
## 10 method be 280
## 11 method , 279
## 12 the material 222
## 13 method to 220
## 14 method ( 215
## 15 of material 214
## 16 this material 208
## 17 supplementary material 203
## 18 material and 173
## 19 this method 151
## 20 material in 148
## 21 material ( 146
## 22 and methods 143
## 23 method 2.1 133
## 24 method and 133
## 25 material at 131
## 26 method use 129
## 27 method describ 127
## 28 t material 126
## 29 method : 116
## 30 a method 94
occurrences<-which(x$lemma=="material")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus(index))
subset_corpus<-do.call(rbind, subset_corpus)
stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="material", title="Co-occurences for the lemma material \n when its head_token_id is equal to 0\n when its head_token_id is equal to 0")
head_cooc(stats, lemma="material")
## term1 term2 cooc
## 1 supplementary material 26
## 2 material . 17
## 3 t material 14
## 4 material and 12
## 5 reference material 7
## 6 material ( 6
## 7 . material 6
## 8 copyrighted material 6
## 9 material supplementary 5
## 10 material in 4
## 11 material for 4
## 12 material , 4
## 13 important material 4
## 14 material that 3
## 15 material material 3
## 16 material with 3
## 17 particulate material 3
## 18 composite material 3
## 19 material t 3
## 20 material ; 3
## 21 mesoporous material 3
## 22 material as 3
## 23 material of 3
## 24 material : 2
## 25 material available 2
## 26 material within 2
## 27 stent material 2
## 28 Mesoporous material 2
## 29 Nature material 2
## 30 nanotube material 2
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when head_token_id of lemma material is equal to 0")
head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
## term1 term2 cooc
## 1 supplementary material 26
## 2 material . 17
## 3 t material 14
## 4 material and 12
## 5 reference material 7
## 6 material ( 6
## 7 . material 6
## 8 copyrighted material 6
## 9 material supplementary 5
## 10 and methods 4
## 11 methods t 4
## 12 and method 4
## 13 material in 4
## 14 material for 4
## 15 material , 4
## 16 important material 4
## 17 material that 3
## 18 material material 3
## 19 material with 3
## 20 particulate material 3
## 21 composite material 3
## 22 material t 3
## 23 material ; 3
## 24 mesoporous material 3
## 25 material as 3
## 26 material of 3
## 27 material : 2
## 28 material available 2
## 29 material within 2
## 30 stent material 2
occurrences<-which(x$lemma=="methods")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus(index))
subset_corpus<-do.call(rbind, subset_corpus)
stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="methods", title="Co-occurences for the lemma methods \n when its head_token_id is equal to 0")
head_cooc(stats, lemma="methods")
## term1 term2 cooc
## 1 and methods 16
## 2 methods Material 7
## 3 Mech methods 7
## 4 methods for 6
## 5 methods . 4
## 6 methods , 4
## 7 , methods 3
## 8 . methods 3
## 9 methods 2013 2
## 10 in methods 2
## 11 methods novel 1
## 12 novel methods 1
## 13 methods this 1
## 14 methods Chemical 1
## 15 Immunol methods 1
## 16 methods 2010 1
## 17 methods Mol 1
## 18 Purification methods 1
## 19 methods 28 1
## 20 methods use 1
## 21 assessment methods 1
## 22 methods 2008;44:61e72 1
## 23 methods Nanomaterial 1
## 24 methods MATERIALS 1
## 25 Microscopy methods 1
## 26 methods Polystyrene 1
## 27 \fcount methods 1
## 28 methods silver 1
## 29 Emerge methods 1
## 30 methods and 1
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when head_token_id of lemma methods is equal to 0")
head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
## term1 term2 cooc
## 1 and methods 16
## 2 methods Material 7
## 3 Mech methods 7
## 4 methods for 6
## 5 methods . 4
## 6 methods , 4
## 7 , methods 3
## 8 . methods 3
## 9 methods 2013 2
## 10 in methods 2
## 11 methods novel 1
## 12 novel methods 1
## 13 methods this 1
## 14 methods Chemical 1
## 15 Immunol methods 1
## 16 methods 2010 1
## 17 methods Mol 1
## 18 Purification methods 1
## 19 methods 28 1
## 20 methods use 1
## 21 assessment methods 1
## 22 methods 2008;44:61e72 1
## 23 methods Nanomaterial 1
## 24 methods MATERIALS 1
## 25 Microscopy methods 1
## 26 methods Polystyrene 1
## 27 \fcount methods 1
## 28 methods silver 1
## 29 Emerge methods 1
## 30 methods and 1
occurrences<-which(x$lemma=="method")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus(index))
subset_corpus<-do.call(rbind, subset_corpus)
stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="method", title="Co-occurences for the lemma method \n when its head_token_id is equal to 0")
head_cooc(stats, lemma="method")
## term1 term2 cooc
## 1 . method 115
## 2 method for 88
## 3 method : 39
## 4 method to 34
## 5 method . 30
## 6 : method 20
## 7 method of 19
## 8 method in 17
## 9 method , 17
## 10 method method 15
## 11 method the 11
## 12 the method 11
## 13 method and 10
## 14 Mech method 9
## 15 sensitive method 8
## 16 a method 7
## 17 standard method 5
## 18 Analytical method 5
## 19 method with 5
## 20 nanotoxicity method 5
## 21 vitro method 5
## 22 method that 4
## 23 method use 4
## 24 Nat method 4
## 25 Statistical method 4
## 26 & method 4
## 27 this method 4
## 28 method Enzymol 4
## 29 ; method 4
## 30 simple method 4
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when head_token_id of lemma method is equal to 0")
head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
## term1 term2 cooc
## 1 . method 115
## 2 method for 88
## 3 method : 39
## 4 method to 34
## 5 method . 30
## 6 : method 20
## 7 method of 19
## 8 method in 17
## 9 method , 17
## 10 method method 15
## 11 method the 11
## 12 the method 11
## 13 method and 10
## 14 Mech method 9
## 15 sensitive method 8
## 16 a method 7
## 17 test material 6
## 18 material , 6
## 19 standard method 5
## 20 Analytical method 5
## 21 method with 5
## 22 nanotoxicity method 5
## 23 vitro method 5
## 24 method that 4
## 25 method use 4
## 26 Nat method 4
## 27 Statistical method 4
## 28 & method 4
## 29 this method 4
## 30 method Enzymol 4
We could assume that the last occurrence in an article of the lemma “materials” correspond to the section title “material and methods”. As before, we will use co-occurrences see how words are connected to the last occurrence of “materials” in each documents, and see how often it correspond to a “materials and methods” section.
The first two functions select the last occurrence of a word in a document, and got the id of their sentences. A graph showing the connection of words for this subset of sentences is then plot.
create_subset_corpus_last_lemmas <- function(index){
#this function is aimed to help construct a subset of x for the part of the analysis :
#Co-occurences for materials and material when it is the last lemma of the document
#x[index,] return a token and all the associated data : lemma, but also sentence and doc_id
occurrence<-x[index,] #x[index,], where x is the dataframe of annotation generated by udpipe
sentence_id<-occurrence$sentence_id
doc_id<-occurrence$doc_id
lemma<-occurrence$lemma
occurrences_in_doc=which(x$doc_id==doc_id & x$lemma==lemma)
last_occurrence=occurrences_in_doc[length(occurrences_in_doc)]
if (last_occurrence==index){return(strip_corpus(doc_id, sentence_id))}
return()
}
occurrences<-which(x$lemma=="materials")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus_last_lemmas(index))
subset_corpus<-do.call(rbind, subset_corpus)
# stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
# plot_cooccurrence(stats, lemma="materials", title="Co-occurences for the lemma materials \n when it is the last lemma of the document")
# head_cooc(stats, lemma="materials")
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when materials is the last lemma of the document")
head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
## term1 term2 cooc
## 1 . method 115
## 2 method for 88
## 3 method : 39
## 4 method to 34
## 5 method . 30
## 6 : method 20
## 7 method of 19
## 8 method in 17
## 9 method , 17
## 10 method method 15
## 11 method the 11
## 12 the method 11
## 13 method and 10
## 14 Mech method 9
## 15 sensitive method 8
## 16 a method 7
## 17 test material 6
## 18 material , 6
## 19 standard method 5
## 20 Analytical method 5
## 21 method with 5
## 22 nanotoxicity method 5
## 23 vitro method 5
## 24 method that 4
## 25 method use 4
## 26 Nat method 4
## 27 Statistical method 4
## 28 & method 4
## 29 this method 4
## 30 method Enzymol 4
occurrences<-which(x$lemma=="material")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus_last_lemmas(index))
subset_corpus<-do.call(rbind, subset_corpus)
stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="material", title="Co-occurences for the lemma material \n when it is the last lemma of the document")
head_cooc(stats, lemma="material")
## term1 term2 cooc
## 1 material . 104
## 2 of material 97
## 3 material at 83
## 4 supplementary material 69
## 5 material be 55
## 6 material , 48
## 7 this material 40
## 8 the material 36
## 9 material and 35
## 10 nanotube material 27
## 11 material in 24
## 12 material : 24
## 13 material available 23
## 14 material for 21
## 15 material ( 17
## 16 nanosize material 14
## 17 genetic material 12
## 18 and material 12
## 19 test material 11
## 20 reference material 10
## 21 material refer 10
## 22 material from 9
## 23 t material 8
## 24 material as 8
## 25 a material 8
## 26 . material 8
## 27 material to 7
## 28 material on 7
## 29 nanoscale material 7
## 30 in material 6
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when material is the last lemma of the document")
head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
## term1 term2 cooc
## 1 material . 104
## 2 of material 97
## 3 material at 83
## 4 supplementary material 69
## 5 material be 55
## 6 material , 48
## 7 this material 40
## 8 the material 36
## 9 material and 35
## 10 nanotube material 27
## 11 material in 24
## 12 material : 24
## 13 material available 23
## 14 material for 21
## 15 material ( 17
## 16 nanosize material 14
## 17 genetic material 12
## 18 and material 12
## 19 test material 11
## 20 reference material 10
## 21 material refer 10
## 22 and method 10
## 23 method , 10
## 24 material from 9
## 25 t material 8
## 26 material as 8
## 27 a material 8
## 28 . material 8
## 29 material to 7
## 30 material on 7
create_subset_corpus <- function(index, target){
#this function is aimed to help construct a subset of x for the part of the analysis :
#Co-occurences for lemma materials and material when they are the first lemma of a sentence
#x[index,] return a token and all the associated data : lemma, but also sentence and doc_id
occurrence<-x[index,] #x[index,], where x is the dataframe of annotation generated by udpipe
sentence_id<-occurrence$sentence_id
doc_id<-occurrence$doc_id
#the following line query the first lemma of the sentence in the good document
first_lemma<-x[which(x$sentence_id==sentence_id & x$doc_id==doc_id)[1],]$lemma
if (first_lemma==target) {return(strip_corpus(doc_id, sentence_id))}
return()
}
occurrences<-which(x$lemma=="materials")
subset_corpus<-sapply(occurrences, function(index, target) create_subset_corpus(index, target),
target="materials")
subset_corpus<-do.call(rbind, subset_corpus)
# stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
# plot_cooccurrence(stats, lemma="materials", title="Co-occurences for lemma materials when it is the first lemma of a sentence")
#
# head_cooc(stats, lemma="materials")
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when lemma material is the first lemma of a sentence")
head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
## term1 term2 cooc
## 1 material . 104
## 2 of material 97
## 3 material at 83
## 4 supplementary material 69
## 5 material be 55
## 6 material , 48
## 7 this material 40
## 8 the material 36
## 9 material and 35
## 10 nanotube material 27
## 11 material in 24
## 12 material : 24
## 13 material available 23
## 14 material for 21
## 15 material ( 17
## 16 nanosize material 14
## 17 genetic material 12
## 18 and material 12
## 19 test material 11
## 20 reference material 10
## 21 material refer 10
## 22 and method 10
## 23 method , 10
## 24 material from 9
## 25 t material 8
## 26 material as 8
## 27 a material 8
## 28 . material 8
## 29 material to 7
## 30 material on 7
occurrences<-which(x$lemma=="material")
subset_corpus<-sapply(occurrences, function(index, target) create_subset_corpus(index, target),
target="material")
subset_corpus<-do.call(rbind, subset_corpus)
stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="material", title="Co-occurences for lemma material when it is the first lemma of a sentence")
head_cooc(stats, lemma="material")
## term1 term2 cooc
## 1 . material 32
## 2 material and 27
## 3 t material 4
## 4 material be 4
## 5 material on 3
## 6 methods material 3
## 7 method material 3
## 8 material once 2
## 9 material treatment 2
## 10 material Implant 2
## 11 LDPE material 2
## 12 material after 2
## 13 material & 2
## 14 material , 2
## 15 material in 2
## 16 nano-scaled material 2
## 17 reagent material 1
## 18 Organism material 1
## 19 material with 1
## 20 material supply 1
## 21 characterization material 1
## 22 material characterization 1
## 23 validation material 1
## 24 Reagent material 1
## 25 study material 1
## 26 material composition 1
## 27 altered material 1
## 28 material investigation 1
## 29 Material material 1
## 30 material material 1
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when lemma material is the first lemma of a sentence")
head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
## term1 term2 cooc
## 1 . material 32
## 2 material and 27
## 3 and method 14
## 4 and methods 10
## 5 method 2.1 6
## 6 t material 4
## 7 material be 4
## 8 material on 3
## 9 methods material 3
## 10 method material 3
## 11 material once 2
## 12 method Animal 2
## 13 material treatment 2
## 14 material Implant 2
## 15 LDPE material 2
## 16 material after 2
## 17 material & 2
## 18 & method 2
## 19 material , 2
## 20 material in 2
## 21 nano-scaled material 2
## 22 methods t 2
## 23 reagent material 1
## 24 method . 1
## 25 methods Silica 1
## 26 Organism material 1
## 27 material with 1
## 28 material supply 1
## 29 characterization material 1
## 30 material characterization 1